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Abstract 

We outline a novel means to index all of the data on every 
computing device in an enterprise including devices well out- 
side the reach of all currently available search engines. The 
invention resolves the problems of indexing intermittently 
connected devices, indexing large numbers of individual de- 
vices, and indexing device-specific content not norm ally con- 
sidered amenable to indexing. Further the invention restricts 
indexing to authenticated search engines, enforces selective 
access to the information resolved by a query, and includes 
secure data transmission during indexing, query, and content 
inspection. 

1 Introduction 

Large scale web search requires a crawler, an indexer, and 
a query engine. Crawlers, as their name implies, "crawl" 
across a web following links and fetching web pages for the 
indexer, a massive compute- and storage-intensive applica- 
tion that constructs a comprehensive inverted index of every 
web page uncovered by the crawler. That index may contain 
significant words or phrases extracted from the content of the 
web page, metatags embedded within the HTML proper, and 
inferences made from the link structure of the page both out- 
going and incoming. 

The query engine, the application employed by end users, 
searches the indices constructed by the indexer to return links 
to candidate web pages in response lo the query keywords and 
other criteria (language, domain of origin, age, . . . ) provided 
by the user. A search engine is a triumvirate of crawler, in- 
dexer, and query engine as illustrated in Figure 1 . Web search 
is employed within, and by an enterprise, to assist employees, 
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customers, and business partners in locating the enterprise in- 
formation that they need. 

Crawlers, like browsers, contact web servers for the pages 
that they are fetching and returning to the indexer. With 
rare exception these web servers are hosted on capacious ma- 
chines with ample processing power and storage space. How- 
ever, the number and form of computing devices within an en- 
terprise is expanding. These devices vary in capacity, speed, 
platform, and method of network connection — ranging from 
corporate servers to laptops to personal digital assistants to 
embedded sensor and control networks. The domination of 
networks and the proliferation of devices tends to push data 
storage out to the edges of the corporate network yet the data 
and documents residing on these edge devices are often inac- 
cessible to crawlers and indexing engines: 

♦ No links refer to either the devices or their contents 

♦ Some devices are only intermittently connected to the 
network 

• The protocol assumed by the crawler (HTTP) is unsup- 
ported by the device 

• The content of the device is in a format unknown to the 
indexing engine 

Enterprises realize that much valuable and sensitive infor- 
mation resides in the devices at the edges of the network, and 
they attempt to capture that data with various mechanisms 
including shared file systems, enterprise-wide data reposito- 
ries, knowledge management systems, and centralized data 
archives. Despite their widespread use these methods fail to 
capture much of the data on desktop, mobile, and personal 
devices. 

Shared network volumes may capture only a fraction of 
the data on a desktop and the software required to support ac- 
cess is unsuitable for small, mobile devices. Data repositories 
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Diagram goes here! 



Figure I : The component parts of a search engine and their relationships. 



frequently assume explicit submission; there is no guarantee 
that the repository contents are either current or comprehen- 
sive and the repository indexing (to the extent that it exists 
at all) may rely solely on keywords provided by the submit- 
ter. Knowledge management systems often rely on propri- 
etary formats for content, restrict content to a small number 
of formats, or are specialized for a narrow domain, hence 
are ineffective for indexing and searching a broad array of 
corporate information sources. Often architected as central- 
ized systems they are ill-suited for capturing the ephemera of 
network -centric mobile devices. Finally while corporate data 
archives are an effective means of retaining information of 
durable, lasting, and proven value they are are ineffectual in 
capturing the valuable, but short-lived data, created on desk- 
top, portable, and mobile computing devices. 

Sections 2-4 offer a solution to the problem of bringing 
previously inaccessible information into view of enterprise 
search engines. Section 2 presents a model of Web crawlers, 
one of the three fundamental components of modern web 
search engines. Section 3 introduces the Magi thin server 
technology and explains, using the taxonomy of Section 2, 
how the Magi server both extends the reach of search engines 
to devices that were until now inacessible and expands the 
flexibility, power, and timeliness of Web search engines. Fi- 
nally, Section 4 presents (informally) the specific claims for 
our invention. 



2 Taxonomy 

Formally, a search engine is a triple (C, /., Q) of three sepa- 
rate, but related, applications — crawler, index ex, and query 
engine respectively [1]. 

Each may be characterized by action (what is done), locale 
(where it is done), and time (when it is done). In this manner 
a taxonomy of search engines can be constructed that char- 
acterizes the range of variation available to search engines 
component with respect to action, locale, and time. In the 
discussion below we consider only the crawler of the search 
engine. Future patents will address the the interactions be- 
tween Magi servers and the indexer and query engine of a 
search engine. 



2.1 Crawler 

Crawlers fetch web pages for submission to the indexer of 
the search engine. Given an URL u a crawler C must first 
decide whether or not to visit the web page designated by u. 
If affirmative, the crawler can reach u if: 

• C can connect to the host of u 

• C has the authority to access the page designated by u 

Connectivity and access authority are two separate, but re- 
lated, considerations. The first is the ability to establish a 
TCP connection between the crawler and the web server and 
varies with the the position of the crawler within the network 
(for example relative to a firewall) or the network quality of 
service (such as congestion or routing anomolies). 

Once a connection is established the crawler may be re- 
quired to obtain permission to read the page (for example, it 
may be password protected). Access may also be restricted to 
a finite set of hosts whose identity is determined by inspecting 
the source IP address of the packet stream or by cryptographic 
means. 

Finally, once the crawler has the page content in hand it 
must extract whatever links it can for the next round of crawl- 
ing. Extraction depends on the form and semantics of the 
content and the extractors available to the crawler. For exam- 
ple, all crawlers can (by definition) extract links from HTML 
pages however, few crawlers have the extractors required to 
lift links embedded with PDF or Microsoft Word documents. 

Thus a crawler is characterized by: 

• The IP address of the crawler host which limits, with 
respect to network topology, routing, and firewalls, the 
remote hosts to which the crawler can connect 

• Access authority 

• A loading policy that gleans URLs of value from the set 
at hand 

• An extraction policy that determines if the contents of a 
web page will yield URLs 

• A set of extractors for extracting URLs from various 
forms of content 
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With this as background we present a formal model of a 
crawler that reflects the components and process outlined 
above. We adopt, with minor amendments, the definitions 
and notation of [2]. 

Notation 1 Let an URL u be given, then p u denotes the 
HTTP response for u (nominally the page contents of u). If 
V is a nonempty set of URLs then pu = U U £UPu- 

For the sake of conciseness take p u to denote both the URL 
u and the page contents of u. 

Definition 1 An extractor is a function g(p u ) such that either 
g[p u ) = L meaning that g is ill-defined on p u or g{p u ) = 
S where S is the set (possibly empty) of all links (URLs) v 
extracted from p u 

Definition 2 A extraction policy E(p u ) is a decision proce- 
dure that returns true if links (URLs) can be extracted from 
p u and false otherwise. 

E may inspect the URL ti, the MIME type of u (contained 
within the HTTP response) and the page contents p u since all 
offer valuable hints as to the format and structure of p u . For 
example, if the MIME type is "html" then />„ is a page whose 
structure is is well defined (by the HTML specification) and 
amenable to the extraction of links. If the MIME type is un- 
specified (the HTTP response omitted the Content -Type 
header field) then the crawler may examine the syntax of the 
URL or the content itself to infer the media type. For exam- 
ple, the URL suffix . wav or . au indicates (by common con- 
vention) an audio file which may contain links (rendered as 
speech) but whose extraction by machine agents is problem- 
atic at best However, some audio formats make provision for 
the inclusion of digital metadata and the crawler, if equipped 
with a suitable extractor, may be able to extract that metadata 
for the benefit of the indexer. 

Definition 3 A loading policy is a decision procedure L(u) 
that returns true if URL u is deemed suitable for loading and 
false otherwise. 

A page loading policy determines whether a crawler ignores 
robot excluded pages and generated pages (such as those pro- 
duced by CGI scripts) and honors page loading, resource, and 
time limits with respect to a site or domain. Other considera- 
tions may also play a role in the formulation of L. 

Definition 4 Let an IP address a, a URL u. and a set 
of access permissions P (including passwords, encryption 
keys, and access procedures) be given. The access function 
A(a t ti, P) returns p u if and only if it is possible to access u 
from a and P grants sufficient authority. 

Definitions A crawler C is a tuple <a, A, P % £,G, L) 
where: 



• a is the location (IP address) of C 

• A is an access function 

• P is a set of access permissions 

• E is an extraction policy 

• G is a nonempty set of extractors 

• L is a loading policy 

Definition 6 Let E a link extraction policy, an extractor g, 
and a nonempty seed set S = {u 0 . . . . . u m -i} of URLs be 
given. We say that an URL v can be extracted from p s with 
respect to E and g if and only if there exists u £ S such that 

• E(p u ) is true; and 

• v 6 g{p u ) 

Definition 7 An URL u is reachable in one step by crawler 
C = (a. A. P t E % C, L) with respect to a set of URLs 5. writ- 
ten S £ u. if u 6 S or: 

• u can be extracted from p s with respect to E and some 
geG 

• L(u) is true: and 

• A(a,u,P) = p u 

Definition 8 An URL u is reachable by crawler C with re- 
C 

specttoasetofURLs SifS u. that is. (S. u) is contained 
in the transitive closure of 

The definition of reachable is easily extended to page sets. 

Definition 9 Given page sets ps.pr then pr is reachable 
from p s by crawler C if and only if 

• Ps C Pt: and 

• Vp t e pr3p a € p s such that i is reachable by C from s 

The set of pages reachable by a crawler C is just the max- 
c 

imal page set in the relation 

3 Magi Thin Server 

Our solution expands the Internet search model to bring 
scaleable, comprehensive, and speedy enterprise- wide 
searching to any enterprise device. The cornerstone of 
this solution is the deployment of a specialized, resource- 
conservative, embedded HTTP server on every mobile, 
portable, and computer-enabled device within an enterprise. 
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Placing a small web server on each and every device en- 
ables existing search engines — without change or modifica- 
tion — to crawl, index, and search devices and information 
pools that were previously inaccessible. The Magi server is 
so small that it can be installed on virtually any device, in- 
cluding personal devices like the Palm Pilot, a WAP phone, 
or an embedded microcontroller, and it provides all of the 
services required by any Web compliant crawler, 

Placing Magi on desktop workstations permits enterprise 
search engines to discover, index, and query the content res- 
ident on desktop machines thai was, for all intents and pur- 
poses, unknown, uninventoried, and largely inaccessible. Us- 
ing their existing search engine an enterprise can, with no 
additional efTort, discover and access new sources of content 
enterprise-wide. 

A device-specific Magi server translates device-dependent 
content into Web-standard formats such as HTML or XML. 
Consequently, information and data for which no extractors 
existed can now be crawled and extracted by the crawler of 
any common Web search engine. In effect the Magi server is 
supplying an extractor for the benefit of the crawler. 

A Magi server can offload the task of extracting URLs and 
content from the crawler by creating virtual URLs that exist 
specifically for the benefit of a crawler. When a client (such 
as a crawler) visits such an URL (using the standard HTTP 
protocol so no changes to the crawler are required) the Magi 
server generates a dynamic page that summarizes the content 
of the original page. This reduces network transfers and load 
and improves the incisiveness of the indexing. 

The technique of creating URLs for the benefit of a crawler 
can be further extended to create and push device-specific 
content to a repository or indexer. The Magi server also has 
the ability to authenticate and control access. Let u be an 
URL served by a Magi server M where u denotes content in 
a format for which no crawler extractor exists (for example, 
a device-specific configuration). M creates a virtual URL v 
that when accessed triggers the translation of the content of u 
into standard HTML. In this manner M performs ex traction 
on behalf of the crawler giving it access to formats for which 
no crawler extractors exist. 

In addition the architecture of the Magi server permits 
both device- and content-specific modules (plugins) that will 
translate non-HTML content into standard HTML for the 
benefit of a crawler. Finally, Magi can be configured as 
a crawler- and indexer-aware server that offers device- and 
content-dependent indexes directly to the crawler. In other 
words, the authentication protocols and access controls of 
Magi allow the Magi server to generate crawler-specific con- 
tent that is optimized for the indexer of the search engine. For 
example, the Magi server executing on a Palm Pilot can gen- 
erate a summary of the Pilot's memopad that is suitable for 
cross-indexing with a departmental project web site. 

I The notions of a crawler and indexer can be generalized 



considerably if the device server knows of, and cooperates 
with, the crawler. The architecture outlined here permits the 
deployment of enterprise- and domain-specific crawlers and 
indexers. Crawlers can be deployed within an enterprise to 
search for a specific form of content, for example, all content 
relating to a specific project. The crawler can move from de- 
vice to device throughout the enterprise network and, with the 
cooperation of the Magi server onboard each device, can be 
served with just the content sought by the crawler, including 
relevant and incisive indices. 

Intermittam connection is, at present, an insurmountable 
obstacle for crawlers as the target device (server) may not be 
connected to the network as the crawler is making its rounds. 
We offer several solutions to this problem; their suitability 
depends upon the capabilities of the target device: 

• Magi servers announce their presence to the network 
when the device connects. This event notification is 
propagated to all interested subscribers including, for 
example, the enterprise crawler, thereby allowing the 
crawler to immediately visit the device and push needed 
content back to the enterprise indexer 

• While offline Magi servers can preindex their relevant 
content for the search engine and when connected push 
an index set back to the search engine for inclusion in, 
and integration with, the enterprise index base 

• Magi servers can push their content (all or selected por- 
tions) to the search engine (or to a proxy to deliver to the 
search engine on behalf of the device). This content is 
handed off to the indexer for index generation 

• Alternatively, a Magi server can bleed its content incre- 
mentally to a search engine over the span of multiple 
connections to the network. This strategy might be ap- 
propriate for a device that is connected for only brief 
periods or supports only a low bandwidth connection 

Because the Magi server on a device can act in an au- 
tonomous fashion it can index the device itself and notify the 
central index that the device is connected, and or has an up- 
date for the central index. In this fashion mobile and intermit- 
tently connected devices can be intelligently included in en- 
terprise data searches. The search engine will not only be able 
to find data on all devices in the network, it will also instantly 
know the connection status of the device that contained the 
data pointed to by the metadata in the index. This opens the 
possibility of contacting the device owner in real-time to re- 
quest mat the device with critical data be connected as soon 
as possible. Magi can filter responses for sensitive data. This 
means that financial, HR, medical, personal, or other sensi- 
tive data can be contained appropriately. The power inherent 
in Magi authentication gives great control over how data is in- 
dexed and served, making enterprise-wide indexing a flexible 
yet secure tool. 
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4 Claims 

Claim 1 A system for indexing the: 

• Information intake, contents, and output 

• Hardware configuration, settings and status 

• Software configuration, settings and status 

• System and control logs 

• Manner, rate, pattern, and frequency of use 

of any device with one or more embedded digital proces- 
sors including, but not limited to, mobile phones, telephones, 
printers, fax machines, personal digital assistants, portable 
digital devices, digital media players and recorders, appli- 
ances, heating, ventilation, communication, and electrical 
systems, sensors and actuators, automotive electronic and 
mechanical systems, technical, scientific, and medical instru- 
ments, machine tools, and materiel handling, manufacturing, 
assembly, and delivery systems comprising: 

• For each such device a Magi web server resident on the 
device 

• For each such device a network interface for fetch- 
ing HTML pages from the devices in accordance with 
device-specific URLs and URL links 

• For each such device one or more device-specific Magi 
modules to translate device-specific information into 
text, HTML, or XML web pages 

• A Web search engine consisting of a crawler, indexer, 
and query engine 

Claim 2 The system of claim I wherein devices whose net- 
work interface is disconnected from the network for extended 
, periods or whose connection to the network is intermittent are 
crawled with content collected in order of decreasing priority 
or criticality. 

The presence service of Magi transmits a notification to all 
subscribers of the appearance (connection) of a Magi-enabled 
device. A subscriber may subscribe to presence notification 
for a given set of devices d ly . . .,d n or, with sufficient au- 
thority, the presence notification for all devices. In this man- 
ner the crawler can be notified when a specific device of in- 
terest connects to the network or when any device connects. 
When a device connects to the network (say, a personal digi- 
tal assistant is set into its network and recharging cradle) the 
crawler is notified and immediately begin crawling the de- 
vice, collecting pages from its resident Magi server. Crawl- 
ing continues for as long as. the device is connected. Note 
that the dynamic DNS service of Magi allows the crawler to 



obtain the IP address of the device even if it was dynamically 
assigned and changes from one network connect to another. 
Commercial crawlers, such as Ultraseek, provide an API for 
the implementation of crawling on demand. 

A more selective form of crawling is possible. In this im- 
plementation the crawler waits for direct contact from the de- 
vice itself in which the device informs the crawler of exactly 
where in the page space of the device to begin the crawl. In 
this manner the device can instruct the crawler to collect just 
those pages that have changed since the device was last vis- 
ited by the crawler. The device-resident Magi server can al- 
ternatively inform the crawler of a set of pages to visit ordered 
by priority. 

Finally, some search engines (Google, for example) 
archive the web pages collected by their crawlers. This fa- 
cility allows a search engine user to view the web pages of a 
device even if it is disconnected since the search engine itself 
has a copy (albeit one that may be out-of-date). 

Claim 3 The system of claim I wherein devices whose net- 
work interface is of low bandwidth or unreliable are indexed. 

Low bandwidth connections present a particular challenge 
for crawling the contents of "bandwidth-challenged" devices. 
Here the device-resident Magi server can adopt tactics that 
ameliorate the deficiencies of the connection: 

• Assuming that the Magi server and the crawler have 
a transport encoding in common the server can pre- 
compress offline the pages that it wants crawled and in- 
dexed. Using the mechanisms mentioned in Claim 2 it 
can direct the crawler to crawl just those pages for which 
it has precomputed compressed content. In this manner 
the device can make optimal use of its limited bandwidth 

• The Magi server can break its content into small, indi- 
vidual "mini-pages", no one of which will require a sub- 
stantial amount of transmission time. This technique, 
in combination with compression and directed crawling, 
permits incisive, directed "spot crawling" that allows a 
device to collaborate with a crawler even if the device is 
connected for only a brief period 

Claim 4 The system of claim J wherein only an authorized 
Web search engine may extract content from a device. 

Using the Magi security and authentication services the 
crawler can authoritatively identify itself to the device allow- 
ing the device to be crawled only by authorized crawlers. 

Claim 5 The system of claim I wherein personal, sensitive, 
or proprietary information is securely transferred from a de- 
vice to an indexer. 
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In addition the device and the crawler can securely estab- 
lish a secret session key known to them alone that permits the 
secure transmission of sensitive information from the device 
to the crawler. 



5 Implementation 

Enterprise-wide indexing and search can be implemented in a 
gradual, stepwise fashion that allows the enterprise to gradu- 
ally extend its reach to an ever broader range of devices adapt- 
ing its crawling, indexing, and search services and need and 
circumstances dictate. 

Beginning with its desktops an organization can install a 
Magi server on each personal workstation. Each such server 
is configured with a module that silently in the background, 
as processor cycles and disk bandwidth become available, in- 
dexes the relevant portions of the desktop file system. The 
desktop user need not schedule the server to create or update 
the index. 

For a very small company without the resources for a so- 
phisticated search system, each Magi server would simply 
serve up its metadata index on the fly as a query came in 
from an authorized (corporate) search engine. For larger or- 
ganizations, or those desiring a high-performance search sys- 
tem, each Magi desktop server would push its metadata to a 
centralized index. The Magi server could be instructed to up- 
date the index at the desired level of granularity, or could be 
polled by the index server if desired. Intermittently connected 
devices would register their presence on the network and au- 
tomatically update the index if their metadata has changed 
since the last connection. The Magi server can authenticate 
any connection to the device, hence there is no risk that data 
from a mobile device that is connected to a foreign (i.e. extra- 
enterprise) network will transmit inappropriate information 
to a non-authorized system. In a similar fashion the Magi 
server can filter access to data or metadata based on authenti- 
cation circles. 
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6 Summary 

Corporations will finally be able to realize the benefits of the 
data contained on the edge devices in their networks. As the 
Internet search engines have demonstrated, indexed searching 
makes it possible to quickly find data scattered over a huge 
number of rapidly changing devices. Using Magi technol- 
ogy, corporations can use proven, off-the-shelf, commodity 
components to quickly find any piece of data, on any device, 
in a secure fashion. 



